Loss and Recovery of Genetic Diversity in 
Adapting Populations of HIV 

Pleuni Pennings, Sergey Kryazhimskiy, John Wakeley 

2013 



1 Intro Paragraph / Abstract 

A population's adaptive potential is the likelihood that it will adapt in response 
to an environmental challenge, e.g., develop resistance in response to drug treat- 
ment. The effective population size inferred from genetic diversity at neutral 
sites has been traditionally taken as a major predictor of adaptive potential. 
However recent studies demonstrate that such effective population size vastly 
underestimates the population's adaptive potential (P). 

Here we use data from treated HIV-infected patients ^ to estimate the ef- 
fective size of HIV populations relevant for adaptation. Our estimate is based on 
the frequencies of soft and hard selective sweeps of a known resistance mutation 
K103N. We observe that 41% of HIV populations in this study acquire resistance 
via at least two functionally equivalent but distinct mutations which sweep to 
fixation without significantly reducing genetic diversity at neighboring sites (soft 
selective sweeps). We further estimate that 20% of populations acquire a resis- 
tant allele via a single mutation that sweeps to fixation and drastically reduces 
genetic diversity (hard selective sweeps). We infer that the effective population 
size that determines the adaptive potential of within-patient HIV populations 
is approximately 1.5 x 10 5 . Our estimate is two orders of magniture higher than 
a classical estimate based on diversity at synonymous sites. Three not mutually 
exclusive reasons can explain this discrepancy: (1) some synonymous mutations 
may be under selection; (2) highly beneficial mutations may be less affected 
by ongoing linked selection than synonymous mutations; and (3) synonymous 
diversity may not be at its expected equilibrium because it recovers slowly from 
sweeps and bottlenecks. 

Our results demonstrate the utility of longitudinal genetic data for estimat- 
ing an important evolutionary parameter in non-laboratory populations. Our 
results show that the effective population size obtained from diversity at syn- 
onymous sites can be several orders of magnitude smaller than the effective 
population size relevant for adaptation. 



2 Main text 



We examine evolutionary time courses in populations of the human immun- 
odeficiency virus (HIV) within individual patients in order to understand how 
mutation and natural selection shape genetic diversity in a non-laboratory sys- 
tem. Three factors make HIV well suited to address this question: (a) HIV 
evolves rapidly due to its short replication time and high mutation rate, (b) 
HIV populations in different patients act as independent replicates of the same 
evolutionary process, and (c) the main genetic targets of positive selection in 
HIV are known, as many drug-resistance mutations have been characterized. 

We study how the genetic composition of HIV populations in 30 patients 
treated with a combination of reverse transcriptase (RT) and protease inhibitors 
changes over the course of about one year (see Ref. [5] and Suppl Mat 0). Sam- 
ples taken prior to treatment show nucleotide diversities of ^3% at synonymous 
sites and <1% at non- synonymous sites. 

In samples taken after the initiation of treatment, we observe the fixation of 
adaptive drug-resistance alleles at several known sites in all 30 patients (Figures 
[I] [2] and Suppl Mat 0) leading to substantial decreases in levels of polymor- 
phism in the 1Kb region that was sequenced (Figure [3]). Between the last 
sample before and the first sample after the fixation of the first drug-resistance 
mutation, viral populations lose 53% of their genetic diversity, measured by 
the median drop in per-site heterozygosity. This difference is significant, with 
P < 10~ 3 (all P- values we report are from Mann- Whitney tests). No loss of 
diversity is seen in control intervals in which no resistance mutation was fixed 
(see Suppl Mat 1). We note that in 24 out of 30 patients (80%) we observe the 
fixation of a single mutation, even though the patients are treated with multiple 
drugs. 

The loss of diversity that we observe is a result of hitchhiking: when an 
adaptive mutation rapidly increases in frequency, it takes with it the genetic 
background on which it arose Three factors can attenuate the observed 
loss of diversity after a sweep. First, recombination can preserve some diversity 
by allowing sites at some distance from the selected site to escape the effects 
of the sweep (UJ). Recombinational escape does not appear to be a factor in 
these data however because we do not observe a correlation between post-sweep 
heterozygosity and distance from the selected site (Suppl Mat 1). This is 
consistent with previously reported observations (|5|). 

Second, the amount of diversity lost depends on the origin of the adaptive 
allele. Diversity may survive the fixation of an adaptive allele even in the absence 
of recombination if the selective sweep is "soft", i.e., if the same adaptive allele 
arises via multiple mutations on different genetic backgrounds ©■ On the 
other hand, genetic diversity at neighboring sites is reduced dramatically if the 
selective sweep is "hard", i.e., if the adaptive allele originates through a single 
mutation on one genetic background ©. We demonstrate below that both soft 
and hard sweeps occurred in these patients. 

Third, new mutations can occur between the fixation of a resistance allele 
and the time when samples are taken and the sweep is observed. Since we do 
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not know the actual time of fixation, we do not know the absolute amount of 
diversity loss due to the sweep. However, this does not obscure informative 
patterns of relative loss of diversity, for synonymous versus non- synonymous 
mutations and for soft versus hard sweeps. 

Importantly, whether soft or hard sweeps predominate in a population de- 
pends on the supply of adaptive mutations which is determined by the product 
of the beneficial mutation rate and the current effective population size. If adap- 
tive mutations appear in the population on average more than once per viral 
generation, they are likely to fix via soft sweeps. If adaptive mutations appear 
on average less than once per generation, they typically fix via hard sweeps 
0. Thus, we can assess the supply rate of resistance mutations by determining 
whether the observed selective sweeps are soft or hard. 

In 17 of the 30 HIV populations (57%), we observe the fixation of just one, 
well-known drug resistance mutation, K103N (lysine to asparagine) in RT. This 
mutation confers high-level resistance to the non-nucleoside RT inhibitor used 
by the patients (H]). In 7 of these 17 populations (41%) both codons (AAC 
and AAT) that code for asparagine are present in the population in the first 
sample after the sweep (Figure [I]). These are certain to be soft sweeps. As 
expected, the median reduction of diversity in these populations is only 15% 
and not significantly different from 0. In the remaining 10 patients (59%) only 
one codon is present after the sweep (AAC in 6 and AAT in 4 patients; Figure 
[2]). These sweeps could be either soft or hard. If they were all soft, the reduction 
in diversity at linked sites would be the same as in the cases for which there 
is direct evidence of a soft sweep. However, we find that diversity in these 
populations is reduced by 71% (P = 0.003), and the difference between the soft 
sweeps and potentially hard sweeps is also significant (P = 0.007). Therefore, 
some of the selective sweeps in these 10 patients must be hard sweeps (see Figure 

The fact that we observe both soft and hard sweeps implies that the K103N 
mutation appears roughly once per generation ([7]). A more accurate estimate of 
supply rate based on the frequencies of AAC and AAT alleles in the post-sweep 
samples yields 0.3 new K103N mutation per generation (95% confidence interval 
[0.17,0.97]). Assuming the point mutation rate of 2 x 10 -6 per generation for 
transversions © , we estimate the effective population size of HIV populations 
to be 1.5 x 10 5 [0.8 x 10 5 ,4.8 x 10 5 ] (see Suppl Mat 2 for details). This 
supply rate of resistance mutations is much lower than would be expected if 
the effective population size were equal to the number of virus-infected cells 
in the body, which is estimated to be 10 s (|9l [TIT)) . However, it is much higher 
than estimates of the effective population size based on the level of diversity at 
synonymous sites assuming neutrality of synonymous variation, which would be 
around 10 3 CD [p. 

We propose three possible explanations for the large difference between our 
estimate based on the frequency of soft sweeps and the estimate based on the 
traditional analyses of synonymous diversity. First, some synonymous mutations 
may be deleterious (IT2l). Second, ongoing positive and negative selection may 
have a stronger effect on synonymous mutations than on beneficial mutations. 
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Thirdly, synonymous diversity may take a long time to recover after a sweep or 
a bottleneck. Next, we discuss the second and the third option in more detail. 

HIV populations consist of a heterogeneous collection of genotypes, many of 
which may carry numerous deleterious and advantageous mutations. A new mu- 
tation arises in one such genotype and is therefore linked, at least temporarily, 
to other mutations already present in it. The probability that a new muta- 
tion eventually fixes or goes extinct thus depends not only on its own selective 
effect but also on the combined fitness of the genetic background in which it 
arose (fl~3 l [T4"|) . The effects of linkage on the fates of new mutations are complex 
and comprise an active area of research (|13H18I) , but the qualitative picture is 
straightforward. The fates of neutral, deleterious and weakly beneficial muta- 
tions are entirely determined by the background in which they arise (| 14|) : only 
mutations that arise on high-fitness genotypes have a chance to persist in the 
population (Figure b). As a consequence, the number of segregating neutral 
mutations in such a population is small because high-fitness individuals com- 
prise only a small fraction of the population (fH)l [T7|) . At the other extreme, 
adaptive mutations with very large selective effects survive and spread irrespec- 
tive of the genetic background in which they arise (Figure |4j:). The resulting 
effective population size for such mutations may be close to the census popu- 
lation size (H|). The effective population size for mutations with intermediate 
effects is somewhere in between (Figure [4ji) . 

Next we consider the time it takes for diversity to recover after a sweep. We 
observe that diversity at both synonymous and non-synonymous sites steadily 
recovers (Figure [3]). However, even after 6 to 18 months, synonymous diver- 
sity remains significantly depressed (median reduction —53%, P = 0.01, up 
from —66% directly after the sweep), while non-synonymous diversity is fully 
recovered (median difference +5%, not significant, up from —32% directly af- 
ter the sweep). Hence, even relatively infrequent selective sweeps could keep 
synonymous diversity at a level that is much lower than expected. 

That stronger negative selection leads to faster recovery is consistent with 
recent predictions about the approach of the distribution of allele frequencies to 
stationarity at a single locus (19). This effect is well known in systems biology: 
stronger negative feedback leads to a faster the recovery time (|2T))) . A heuristic 
analysis (Suppl Mat 3), similar to analyses in (j2"Tl 12"2" |) . shows that neutral 
sites recover half of their diversity in roughly N generations while deleterious 
sites recover half of their diversity in roughly l/s generations, where N is the 
effective population size and s is the strength of selection against deleterious 
mutations. Note that for reasonable values of the mutation rate, the recovery 
time is independent of the mutation rate. 

The dynamics of diversity that we have characterized are likely not specific 
to HIV but typical to any large population. For example, European humans 
have reduced synonymous diversity, but a similar amount of non- synonymous 
diversity, compared to Africans (f2"31 [2"4")l . This is consistent with the explanation 
that non-synonymous diversity has had enough time to recover since the out-of- 
Africa bottleneck, but synonymous diversity has not. This observation leads to a 
counterintuitive practical implication. Since non- synonymous diversity recovers 
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faster, it is usually closer to its equilibrium than synonymous diversity. Thus 
non-synonymous sites may in fact be more useful for classical population genetics 
inference than synonymous sites. 

In conclusion, we observe soft and hard sweeps in HIV, which leads to an 
estimated effective population size relevant for adaptation of around 1.5 x 10 5 . 
This number is much higher than what the observed level of synonymous diver- 
sity suggests. A caveat of our approach is that we assume that the supply rate 
of beneficial mutations is similar in all patients, which may not be true. Larger 
samples are needed to estimate the supply rate per patient, and determine how 
much variation there is between patients. Our results confirm what has already 
been predicted theoretically, that the idea of a single effective population that 
can be used to describe different aspects of the behavior of a population does 
not work ([T1 I25]) . 
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Selective sweep In patient 058 
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Figure 1: Soft sweep in patient 058. Codon AAA coding for lysine at posi- 
tion 103 was replaced by day 28 with a mixture of codons AAT and AAC both 
coding for asparagine which confers resistance to NNRTI drugs. Genetic diver- 
sity close to the selected site was not significantly reduced. The plot shows only 
the polymorphic sites among the first 500 basepairs of the reverse transcriptase 
region. Each row represents a sequenced viral isolate. Each column represents a 
polymorphic site, with the derived synonymous and non-synonymous polymor- 
phisms shown in black and red respectively. Codon 103 is shown explicitly. 
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Selective sweep In patient 086 
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Figure 2: Putatively hard sweep in patient 086. Codon AAA coding at 
position 103 was replaced by day 84 with codon AAT. Genetic diversity around 
the selected site was strongly reduced. Notations as in Figure [TJ 
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Nucleotide diversity over time 
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Figure 3: Nucleotide diversity over time in 30 patients. Nucleotide di- 
versity at synonymous (black) and non-synonymous (red) sites. The time point 
"before fixation" is the last sample in each patient taken before the observed 
fixation of a resistance mutation. The time point "at fixation" is the first sample 
in each patient in which the drug resistance mutation is observed to be fixed. 
The third and the fourth time points denote samples in each patient taken 1- 
180 days and 181-700 days after the observed fixation of the resistance mutation 
respectively. 
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Figure 4: Fates of mutations in a heterogeneous population. This 
schematic shows the fates ("fix", fixation; "ext" extinction) of mutations with 
different selective effects, as indicated in each panel, depending on the fitness of 
the genotype they occur in. The light gray area shows the distribution of fit- 
nesses in the population. Small gray arrows show how natural selection changes 
the distribution of genotypes: fitter than average genotypes increase in fre- 
quency, less fit than average genotypes decrease in frequency. Colored arrows 
show the effect size of new mutations: neutral mutations (effect size 0) are shown 
in blue, adaptive mutations (effect size > 0) are shown in orange. Mutations 
that are destined for fixation (extinction) are shown with solid (dashed) arrows. 
Mutations with a large beneficial effect can fix if they arise in any background, 
whereas mutations with a small beneficial effect can fix only if they arise in a 
very fit background. The resulting effective population size for mutations with 
a given effect size is shown as red shading. 
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4 Supporting Information 



4.1 Suppl Mat 0. 

Out of a larger dataset ©, we used viral sequences from patients for which 
we had samples at two consecutive time-points that satisfy the following two 
criteria. (1) At the first time-point, no known resistance mutations to any of the 
drugs used in the trial was present at more than 30% frequency in the sample. 
See below for the list of drugs and mutations. (2) At the next time-point, at 
least one drug resistant allele increases in frequency by at least 70%. There were 
30 such patients. In most cases (26 out of 30) the frequency of the mutations 
changes from to 100%. The four exceptions are patient 89 (mutation G190S 
changed from to 75%), patient 168 (mutation K103N changed from to 
83%), patient 22 (mutation K103N changed from 13 to 100%, while mutation 
V82A changed from to 100%,), and patient 81 (mutation K103N changed 
from 14 to 100%). In most cases (24 out of 30) only a single drug- resistance 
mutation went to fixation (2 mutations in patients 22, 56, 87, 91, 154 and 166). 

The patients were treated with Zidovudine, Lamivudine, Efavirenz and In- 
dinavir (2). There are many known mutations that confer resistance to one or 
more of these drugs. We are interested in the fixation of the first resistance mu- 
tation in a viral population. We used the following list of major drug-resistance 
mutations (the number is the codon and the letter is the amino-acid that con- 
fers resistance) . Protease: 46IL, 82AFT, 84V; Reverse Transcriptase: 41L, 62V, 
65R, 67N, 70R, 751, 77L, 1001, 101P, 103N, 106MA, 1081, 116Y, 151M, 181CI, 
184VI, 188LCH, 190SA, 210W, 215YF, 219QE, 225H (gSJ)- 

Our dataset consists of coding sequences of the regions of the Pol gene that 
encode for protease (all 297 basepairs) and reverse transcriptase (first 689 base- 
pairs) . We analyze separately all third codon position sites and all first and sec- 
ond codon position sites. Most observed mutations at third codon position sites 
have no effect on the amino acid, and we expect these synonymous mutations 
to be neutral or nearly neutral. Throughout the paper, we refer to mutations 
at first and second position sites as non-synonymous and mutations at third 
position sites as synonymous. Most mutations at the first or second codon po- 
sitions change the amino acid, and we expect that most such non- synonymous 
mutations are selected against. 

No change in genetic diversity was observed in 27 control intervals that also 
started with drug susceptible virus but in which no fixation of drug resistant 
amino-acids occurred (largest frequency change of drug resistant amino-acid in 
these intervals was 30%). 

We found no difference in pre-sweep diversity at third codon position sites 
between hard sweep and soft sweep patients. 

4.2 Suppl Mat 1. 

To determine whether recombination affects the reduction of diversity, we split 
the sequences into a part close to the selected site (less than 50, 100, or 200 
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basepairs distance from the selected site) and a part far from the selected site. 
We found no difference in the amount of diversity loss in the close versus far 
sites. Earlier work suggests that selection may be so strong, relative to the 
recombination rate, that the sweep can affect the entire genome ([51 |2"T)). 

4.3 Suppl Mat 2. 

We can use the counts of the two beneficial alleles (AAT and AAC at the 103rd 
amino acid of RT) in the post-sweep samples to estimate the the supply rate 
of resistance mutations. We assume that the mutation rate /iaat to the AAT 
codon and the mutation rate /xaac to the AAC codon are equal to each other, 
Maat = Maac = A 1 - If N is the effective population size, then 9 = 2Nfi, and 
0.5 x 9 is the total population-wide beneficial mutation rate, which we also call 
"the supply rate of resistance mutations" . 

Theory predicts that the population frequency x of the AAC allele follows a 
beta-distribution with parameters (8/2,0/2) (see ([7])). The number k of AAC 
alleles in a sample of size n follows the binomial distribution with parameters 
n and x. Thus, the likelihood of observing fci,...,fejif AAC alleles in sam- 
ples of size ni, . . . , Um m M patients is given by the product of beta-binomial 
distributions, 

TOr h a\ ^^ £(^ + 9/2,^-^ + 9/2) 

L(fcl ' ' • ' ' kM > 6) = 11 UJ 5(9/2,9/2) ' (1) 

where B(a, b) is the beta-function. By maximizing expression ([!]), we obtain the 
maximum likelihood estimate 9 = 0.62. The 95% bootstrap confidence interval 
[0.34, 1.94] is obtained by resampling patients. The estimated supply rate of 
K103N mutations into the population is 0.5 x = 0.31. 

We estimate the effective population size as 0/2 fi. Because we only consider 
the K103N mutation, \i sa 2 x 10" 6 (0[28])- Note that this number is lower than 
the typical per site mutation rate, because most (~ 85%) mutations in HIV are 
transitions, whereas the A to C and A to T mutations that create the K103N 
change, are both transversions. We thus estimate the effective population size 
to be 1.5 x 10 5 (95% confidence interval [0.8 x 10 5 ,4.8 x 10 5 ]). 

For a given sample size and a given supply rate of beneficial mutations, we 
can also predict the probability that the beneficial mutation originates from 1, 

2 or more mutational origins following Ref[71 For a sample of size 6 (the median 
in the dataset) and the estimated supply rate of beneficial mutations of 0.31, 
the predicted probability that all observed beneficial alleles in the sample stem 
from a single origin is 30%, the probability that they stem from 2 origins is 43%, 

3 origins: 22% and 4, 5 or 6 origins: 6%. This shows that even if the supply 
rate is exactly the same in all populations, the observed sweep signature can 
vary widely. 



13 



4.4 Suppl Mat 3. 

Because the patients were followed for approximately one year (median, appr. 
200 generations ([2T)f). it is possible to determine whether the viral populations 
in the patients recover from the observed selective sweep. We binned the ob- 
servations in three bins. Bin 1: directly after the fixation event (the day the 
resistance mutation was detected, most likely this is shortly after it was fixed 
and allowed an increase in viral load). Bin 2: observations between 1 and 180 
days after the fixation was observed. Bin 3: between 181 and 700 days after the 
fixation was observed. If a second drug resistance mutation became fixed in the 
virus of a patient the sample in which this was observed and any later samples 
were removed from the analysis. 

In order to model the recovery of heterozygosity, we assume that during the 
periods between sweeps of resistance mutations there is no positive or balanc- 
ing selection, only mutation, random drift, and negative selection. We further 
assume that negative selection on linked sites simply reduces the effective pop- 
ulation size and may thus be captured by the random drift term in standard, 
single- locus population-genetic models ([3"U]) . 

We assume two alleles: the 'non-mutant' A\ and the 'mutant' A%, with 
relative fitnesses 1 and 1 — s, respectively. With probability u A\ mutates to A 2 , 
and with probability v A 2 mutates to A\. Reproduction occurs according to the 
haploid Wright-Fisher model, with non-overlapping generations and constant 
population size N. If the current frequency of the mutant is x, then 

x „ = x(l - s)(l - v) + (1 - x)u 
1 — sx 

gives the frequency after selection and mutation. The number of mutants in the 
next generation is binomial with parameters N and x", so that its frequency X 
has expectation E[X] = x" and variance Var[X] = x"(l — x")/N. 

We are interested in the recovery of heterozygosity, and so consider 

E[AH] = E[2X(l-X)-2x(l-x)] (3) 

= 2x"(l-x")(l-^-2x(l-x). (4) 

where the (1 — ^) term reflects the loss of heterozygosity due to identity by 
decent (coalescence). We seek a simple heuristic formula that will aid in un- 
derstanding the recovery of heterozygosity after a selective sweep. Assuming 
that s, u, v, and 1/N are all small, which is appropriate for HIV, and further 
assuming that the mutant frequency x is small, gives 

E[AH] w 2u - \ s + 3u + v + H, (5) 

in which H — 2x(l — x) f=s 2x. We apply this result to the recovery of 
heterozygosity at synonymous and non-synonymous sites heuristically using a 
continuous-time approximation. 
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For synonymous sites, we assume that all mutations are neutral and that 
u, v < i. Then ^§ = 2u - ±H, and if H(0) = 0, we have H(t) = 2Nu(l - 
exp(—t/N)). The response time, thaif, defined as the time it takes for H to 
recover 50% of its loss of diversity, can be found by solving 0.5 = exp(—t ha if/N) 
giving thaif — AT log 2. Note that under the assumption u,v <C -h, the response 
time is independent of the mutation rates and it is possible to estimate TV 
independently from u. For non-synonymous sites, we assume that selection is 
stronger than both random drift and mutation, or and solve ^ = 2u — sH to 
obtain H(t) = 2(w/s)(l — exp(— st)). In this case the response time is thalf — 
s _1 log 2. Thus s may be estimated independently from u. These expressions for 
thaif illustrate that non-synonymous sites will recover faster than synonymous 
sites if s > jf. Figure [5] shows that these approximate, heuristic expressions 
agree well with simulations as long as the mutation rate is sufficiently small. 

Song and Steinrucken (fT9|) have recently described a method for studying 
the approach to stationarity of the distribution of allele frequencies, and also 
illustrated that recovery is faster when selection is stronger. 
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Figure 5: Comparison of the response time thaif, defined as the time it takes 
for H to recover 50% of its loss of diversity, from the heuristic model and from 
simulations. Dashed lines show thaif — s^ 1 k>g(2) (for s > 0) or thaif — AHog(2) 
(for s = 0). Dots display averages over 1000 simulations of a haploid Wright- 
Fisher population with N = 10 individuals for each pair of values of u and s. 
In all cases, v = u. 
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